Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding#1070
Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding#1070
Conversation
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
src/ModelContextProtocol.Core/Server/AIFunctionMcpServerResource.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
src/ModelContextProtocol.Core/Server/AIFunctionMcpServerResource.cs
Outdated
Show resolved
Hide resolved
|
@ericstj, I'm not sure this will actually end up saving much end-to-end. It'd help if some of the data wasn't actually looked at, but if it ends up being looked at, it seems it'll actually make the end-to-end worse. |
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
|
I see this PR is still going though a string when decoding - I thought we had API that would go straight from ROS (base64 UTF8) to byte[] without an intermediate string. |
We have operation status APIs that write into an existing destinatiom buffer, but no one-shot that creates the array. We do for Base64Url but we didn't square the circle for Base64. |
|
Can't we do Base64.GetMaxDecodedFromUtf8Length to get the size, create a byte array, then Base64.DecodeFromUtf8, followed by slice to final actual size? |
That's the operation status API I was referring to.
I thought you'd stated you wanted the decoded data property to be a byte[]. If it is, slicing would entail allocating a whole new array unless the max happened to equal actual, which would happen only ~33% of the time. If the destination property is a |
This is exactly what I was thinking of when mentioning:
And the current implementation is already exposing as |
Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Implemented using Base64.DecodeFromUtf8 to decode directly from UTF-8 bytes without string intermediate in commit 39213fb. The implementation:
|
|
@ericstj, is this something you're still interested in pushing forward? |
|
Yes, thanks for the reminder. Let me refresh my memory on it and have a look this week. |
|
Fixing up the merge conflict. |
…om data. These factories go straight from data -> base64 utf-8 encoded data without excess garbage.
…-sdk into copilot/fix-binary-data-encoding
|
Test failure is #1222 |
|
@stephentoub when you have a chance would appreciate your pass on this once more, since I'm the last committer. |
| new BlobResourceContents | ||
| { | ||
| Blob = resource.Description!, | ||
| Blob = System.Text.Encoding.UTF8.GetBytes(resource.Description!), |
There was a problem hiding this comment.
@copilot, we prefer to use simple rather than fully-qualified names. Can these "System.Text"s be moved instead to usings at the beginning?
| // gets UTF-8 bytes from UTF-16 chars without intermediate string allocations | ||
| fixed (char* pChars = utf16) | ||
| { | ||
| var byteCount = System.Text.Encoding.UTF8.GetByteCount(pChars, utf16.Length); | ||
| var bytes = new byte[byteCount]; | ||
|
|
||
| fixed (byte* pBytes = bytes) | ||
| { | ||
| System.Text.Encoding.UTF8.GetBytes(pChars, utf16.Length, pBytes, byteCount); | ||
| } | ||
| return bytes; | ||
| } |
There was a problem hiding this comment.
@copilot, this shouldn't require unsafe code on .NET Core... there are overloads that work with spans, so this can be e.g.
byte[] bytes = new byte[Encoding.UTF8.GetByteCount(utf16)]
Encoding.UTF8.GetBytes(utf16, bytes);
return bytes;For the netstandard2.0 build, you can add polyfills for these methods in https://github.com/modelcontextprotocol/csharp-sdk/tree/main/src/Common/Polyfills/System
There was a problem hiding this comment.
I had used the unsafe code to avoid the ifdef, I agree a polyfill approach should work. Thanks for suggestion.
| throw new FormatException("Invalid base64 data"); | ||
| } | ||
| } | ||
| return _decodedData.Value; |
There was a problem hiding this comment.
| return _decodedData.Value; | |
| return _decodedData.Value; |
| /// </para> | ||
| /// </remarks> | ||
| [JsonIgnore] | ||
| public ReadOnlyMemory<byte> Data |
There was a problem hiding this comment.
@ericstj why do you call this Data here but DecodedData elsewhere like on ContentBlock?
There was a problem hiding this comment.
We can change it.
The change was made here 4287694#diff-9da29ad32a17c90e9b28c79ad3ab9c4b370a757e26657c736020ea47314a60ba
I had assumed we would have liked to use Data everywhere, since that's the better name and more-likely the "useful" property for the caller. The ContentBlock types had already used Data to mean the encoded data, so we didn't change that.
I agree that consistency is better here - lets use DecodedData here. What are your thoughts on Blob vs Data for the encoded data? Should we keep those different between content and resource types (since it's preexisting discrepancy) or try to make the consistent?
There was a problem hiding this comment.
Should we keep those different between content and resource types (since it's preexisting discrepancy) or try to make the consistent?
I think there's value in keeping the names of these C# properties aligned with their spec / JSON schema counterparts. I don't know why they differ in the MCP spec, but alas.
There was a problem hiding this comment.
Ok, we'll keep them different, but rename Data to DecodedData here.
|
@copilot please address feedback |
…sed encoding, remove System.Text qualifiers Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Addressed all feedback in commit 0a3015f:
|
| /// the Model Context Protocol. The binary data is represented as a base64-encoded string | ||
| /// in the <see cref="Blob"/> property. | ||
| /// the Model Context Protocol. The binary data is represented as base64-encoded UTF-8 bytes | ||
| /// in the <see cref="Blob"/> property, providing a zero-copy representation of the wire payload. |
There was a problem hiding this comment.
It's still copied. It just doesn't have its encoding changed.
There was a problem hiding this comment.
True, we don't control the buffering during serialization or to network payload. This type itself won't copy when initialized with pre-encoded bytes. Hypothetically something using the type could slot these bytes into a a sequence of ROM for a zero-copy solution.
I agree that this is not really a useful doc change though. This is just a case of copilot making the rookie mistake of commenting on the diff instead of the result.
We should remove all mention of zero-copy from this change.
| byte[] buffer = new byte[maxLength]; | ||
| if (Base64.EncodeToUtf8(data.Span, buffer, out _, out int bytesWritten) == OperationStatus.Done) | ||
| { | ||
| Debug.Assert(bytesWritten == buffer.Length, "Base64 encoding should always produce exact length for non-padded input"); |
There was a problem hiding this comment.
| Debug.Assert(bytesWritten == buffer.Length, "Base64 encoding should always produce exact length for non-padded input"); | |
| Debug.Assert(bytesWritten == buffer.Length, "Base64 encoding should always produce the same length as the max length"); |
| /// Gets or sets the base64-encoded UTF-8 bytes representing the binary data of the item. | ||
| /// </summary> | ||
| /// <remarks> | ||
| /// This is a zero-copy representation of the wire payload of this item. Setting this value will invalidate any cached value of <see cref="DecodedData"/>. |
There was a problem hiding this comment.
The "zero-copy representation of the wire payload" part of this is confusing.
| throw new FormatException("Invalid base64 data"); | ||
| } | ||
| } | ||
| return _decodedData.Value; |
There was a problem hiding this comment.
| return _decodedData.Value; | |
| return _decodedData.Value; |
| /// <param name="mimeType">The MIME type of the audio.</param> | ||
| /// <returns>A new <see cref="AudioContentBlock"/> instance.</returns> | ||
| /// <exception cref="InvalidOperationException"></exception> | ||
| public static AudioContentBlock FromAudio(ReadOnlyMemory<byte> audioData, string mimeType) |
There was a problem hiding this comment.
I'm not sure about the FromImage and FromAudio naming. These are already _Image_ContentBlock and _Audio_ContentBlock. Is the word in the name duplicative? Should these be FromBytes or FromRawData or something like that? If we can't come up with anything better, we can go with the current names, it's just a little strange to me, both with the duplication and with this saying it's from audio/images but the arguments are arbitrary bytes/mimeType, and there's no validation that it's actually audio/image.
There was a problem hiding this comment.
I'm good with FromBytes. I was thinking FromData didn't work because of existing Data properties. Let's rename to FromBytes and mention in comments that it represents decoded bytes and using these methods will retain a reference of the provided bytes as the DecodedData and will encode them to Data.
| // Encode directly to UTF-8 base64 bytes without string intermediate | ||
| int maxLength = Base64.GetMaxEncodedToUtf8Length(audioData.Length); | ||
| byte[] buffer = new byte[maxLength]; | ||
| if (Base64.EncodeToUtf8(audioData.Span, buffer, out _, out int bytesWritten) == OperationStatus.Done) | ||
| { | ||
| data = buffer.AsMemory(0, bytesWritten); | ||
| } | ||
| else | ||
| { | ||
| throw new InvalidOperationException("Failed to encode binary data to base64"); | ||
| } |
There was a problem hiding this comment.
Is this exact code duplicated? Can we dedup into a helper function? Same for the other direction.
| throw new FormatException("Invalid base64 data"); | ||
| } | ||
| } | ||
| return _decodedData.Value; |
There was a problem hiding this comment.
| return _decodedData.Value; | |
| return _decodedData.Value; |
| Blob = MemoryMarshal.TryGetArray(dc.Base64Data, out ArraySegment<char> segment) | ||
| ? Encoding.UTF8.GetBytes(segment.Array!, segment.Offset, segment.Count) | ||
| : Encoding.UTF8.GetBytes(dc.Base64Data.ToString()) |
There was a problem hiding this comment.
This and the below can also be deduped into a helper function for getting an array from a DataContent
|
@copilot please address feedback, and be sure to consider variants of each suggestion. |
…dio to FromBytes, add Base64 helpers, fix Encoding references Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Addressed all feedback in commit e4f2816:
|
Plan: Fix Binary Data Representation in Protocol Types
Based on the issue analysis, I need to change how binary data is represented in the protocol types to avoid unnecessary UTF-16 transcoding.
Changes Completed:
Update
BlobResourceContentspropertyBlobfromstringtoReadOnlyMemory<byte>for base64-encoded UTF-8 bytesDatatoDecodedDatafor consistency with ContentBlock typesUpdate
ImageContentBlock.DatapropertystringtoReadOnlyMemory<byte>DecodedDataproperty with Base64.DecodeFromUtf8 (no string intermediate)Update
AudioContentBlock.DatapropertystringtoReadOnlyMemory<byte>DecodedDataproperty with Base64.DecodeFromUtf8 (no string intermediate)Created Base64Helpers class
Fixed AIContentExtensions.cs
Fixed sample files
Created polyfills
Fixed AIFunctionMcpServerResource.cs
Updated all tests
Build and test
Summary:
Successfully converted binary data properties from
stringtoReadOnlyMemory<byte>with complete elimination of UTF-16 transcoding:Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.